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Abstract. We consider the problem of aggregating the elements of a possibly infinite dictionary 
for building a decision procedure that aims at minimizing a given criterion. Along with the dic- 
tionary, an independent identically distributed training sample is available, on which the perfor- 
mance of a given procedure can be tested. In a fairly general set-up, we establish an oracle in- 
equality for the Mirror Averaging aggregate with any prior distribution. By choosing an appropriate 
prior, we apply this oracle inequality in the context of prediction under sparsity assumption for the 
problems of regression with random design, density estimation and binary classification. 



l. Introduction 



In recent years several methods of estimation and selection under the sparsity scenario have 
been discussed in the literature. The £\ -penalized least squares (Lasso) is by far the most studied 
one and its statistical properties are now well understood (cf., e.g., (j. [ll],[l2 , [l3l.[33.l49l,[53.[57l.l53] 



and the references cited therein). Several other estimators are closely related to the Lasso, such 
as the Elastic net 1 59] , the Dantzig selector 11511 , the adaptive Lasso 1 60] , the least squares with 



entropy or £ 1+ g penalization I33l, 13411 ■ etc. These estimators are obtained as solutions of convex 
or linear programming problems and are attractive by their low computational cost. However, 
they have good theoretical properties only under rather restrictive assumptions, such as the mu- 
tual coherence assumption [2J], the uniform uncertainty/restricted isometry principle 1 15], the 
irrepresentable EH] or the restricted eigenvalue 0] conditions. Roughly speaking, these condi- 
tions mean that, for example, in the linear regression context one should assume that the Gram 
matrix of the predictors is not too far from the identity matrix. Such type of assumption is natural 
if we want to identify the parameters or to retrieve the sparsity pattern, but it is not necessary if 
we are interested only in the prediction ability. 

Indeed, at least in theory, there exist estimators attaining sufficiently good accuracy of predic- 
tion under almost no assumption on the Gram matrix. This is, in particular, the case for the 
^o-penalized least squares estimator [lOl Thm. 3.6], 112, Thm. 3.1]. However, in practice this 
estimator can be unstable (cf. [8]). Furthermore, its computation is an NP-hard problem, and 
there is a challenge to find a method realizing a compromise between the theoretical optimality 
and computational efficiency. Motivated by this, we proposed in 1 20, 21, H, 2^] an approach to 



estimation under the sparsity scenario, which is quite different from the £\ penalization tech- 
niques. The idea is to use an exponentially weighted aggregate (EWA) with a properly chosen 
sparsity-favoring prior. Let us note that there exists an extensive literature on EWA, which does 
not discuss the sparsity issue. Thus, procedures with exponential weighting are quite common 
in the context of on-line learning with deterministic data, see 1 18, 32, 50], the monograph 119] 
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and the references cited therein, statistic al p 

in H,li[i,[ii[H[iEE3,ElEi,EilEiEEEa. 



Statistical properties of various versions of EWA are discussed 



On the difference from these works, we focus in Eo, 21, 22, 23| on the ability of EWA to deal with 
the sparsity issue. Specifically, we prove that EWA with a properly chosen prior satisfies spar- 
sity oracle inequalities (SOI), which are comparable with those for the £§ -penalized techniques 
and are even better in some aspects. At the same time, on the difference from the £q -penalized 
methods, our method is computationally feasible for relatively large dimensions of the problem, 
cf. [23|]. Furthermore, our estimator has theoretical advantages as compared to the £\ -penalized 
methods, since it satisfies oracle inequalities with leading constant 1 that hold with almost no 
assumption on the dictionary/ Gram matrix (cf. detailed comparison with the £\ based methods 
in Section[8]below). 

The results of are established for the linear regression model with fixed design. 

The aim of this paper is to show that similar ideas can be successfully implemented for a large 
scope of statistical problems with i.i.d. data, in particular, for regression with random design, 
density estimation and classification. The procedure that we propose is mirror averaging (MA) 
with sparsity priors. The difference from the EWA considered in [20l |2jJ, |22l |23J is that we com- 
pute the exponential weights recursively and then average them out. 

This paper is organized as follows. In Section|2]we introduce some notation and formulate main 
assumptions. Section [3] contains the definition of the MA estimator and a general PAC-Bayesian 
risk bound in expectation. In Section|4]we introduce our sparsity prior and obtain our main SOI 
as a corollary of the PAC-Bayesian bound. Sections El [6] and[3 consider applications of this result 
to specific models, namely, to nonparametric regression with random design, density estimation 
and classification. In Section[8]we briefly discuss computational aspects of the MA aggregate and 
compare it to other methods of sparse estimation. Technical proofs are given in the appendix. 



2. Notation and assumptions 

Let C2f, 5) be a measurable space and let {Pf, f e be a collection of probability measures on 
C2T,31 indexed by some set & '. We are interested the estimation of / based on an i.i.d. sample 
Z\,...,Z n drawn from the probability distribution Pf. We will assume that / is a "functional" 
parameter, that is & is a subset of a vector space 8 = {/ : 9C — - U d \ for some set 9C and for some 
positive integer d. From now on, we denote by Ef the expectation w.r.t. Pf and by Z the random 
vector (Zi, . . . , Z n ) e 2 n . 

To further specify the settings, let £ : 8 x & -» U+ be a general loss function. An estimator of / 
is any mapping / : 2" — » 8 such that the mapping z >-» £{f[z),f), defined on {3 n ,$ n ) and with 
values in U+, is measurable for every / £ The performance of an estimator / is quantified by 
the risk 

E f [£(Jm,f)]:= J J(fW,f)Pp(dz). 
Here PJ 1 stands for the product measure Pf®...®Pf on {3 n ,$ n ). We will assume the following. 

Assumption Ql: There exists a mapping Q : ~E x 8 —- U such that, for every / £ & , 

- the mapping z >-» Q{z,g) is measurable and P^-integrable for every g£8, 

- A(/) = f 2 Q{z, g) Pf{dz) - £{g, f) is independent of g and finite for any / £ . 
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Assumption Ql is fulfilled in a number of settings; detailed discussion is given in Sections [5}- 
[JJ For example, in the case of regression with squared loss, one has z = [x, y) e 2. = SC x R and 
^(g>/) = Igc^S~ f) 2 dPx> where Px stands for the distribution of the design and / is the regression 
function. Assumption Ql is then fulfilled with Q{z, g) = (y- g[x)) 2 . In simple words, assumption 
Ql requires the existence of an unbiased estimator of the risk Hg,f), up to a summand depend- 
ing excusively on /, where / is the unknwon parameter and g is a known function. It is worth 
noting that under assumption Ql the minimizer of the loss function g >-» £{g,f) coincides with 
the minimizer of the expectation g >-» / Q(Z, g)Pf{dZ). This property is crucial in what follows. 

Since, in general, there is no estimator having the smallest possible risk among all possible esti- 
mators, we will pursue a more realistic goal, which consists in finding an estimator whose risk, 
for every /, is nearly as small as the minimal risk ming e j? A ({g,f) over a pre-specified subset 
of 8, i.e., we will follow the oracle approach. To make this approach sensible, the subfamily J^a 
should not be too large. On the other hand, it should be chosen large enough to contain a good 
approximation to the (unknown) "true" function /. 

The set J^a is indexed by the elements of some measurable space (A,£). More precisely, we 
define gP\ = {f\, \ £ A} c <g as a collection of functions (dictionary) such that, for every xeST 
and ze2, the mappings A >-» f\(x), A >-»■ Q{z,f\) and A >-» £{f\,f) from A to R are measurable. 
The elements of the dictionary & A can be interpreted as candidate estimators of /. Define £?\ 
as the set of all probability measures on (A, £) and £P\ G-^a) as me set °f au measures \i e g? A such 
that f A I/aMI \i{d\) < oo for every xef. We define for every \i e ^(^a). 



We say that ^ is a convex aggregate of functions f\ with (i being the mixing measure or the 
measure of aggregation. The estimators we study in the present work are convex aggregates with 
data- dependent mixing measures. 

In what follows, we denote by ^(^a) the set of all convex aggregates of functions f\, that is 

^G^a) = {g ■ 3C - R s.t. g = f [i for some |i £ 9\ {& A )\. 

It is clear that ^(=^a) is a convex set containing i^A- For our main result we need the following 
condition on the function Q appearing in Assumption Ql. 

Assumption Q2: There exist (3 > and a mapping : <^(#a) x ^(^a) — • R+ such that 

i) ( I / p(g,g) = lforallge^(^A), 

ii) the mapping g >-» {g, g) is concave on ^(^a) for every fixed g e ^(^a). 

iii) the inequality 



holds for every g,g e 

At first sight, this assumption seems cumbersome but we will show that it holds for a number 
of settings which are of central interest in nonparametric statistics. For example, in the model 
of regression with random design and additive Gaussian noise, Assumption Q2 is fulfilled for (3 ^ 
2o 2 + 2sup A || /\ -/II^j, where a 2 is the noise variance and / is the unknown regression function. 
Assumption Q2 has been first introduced in |30j, Theorem 4.2] for finite dictionaries and a variant 
of it has been used in |3, Corollary 5.1]. 





exp ( - p- x {Q(z, g) - Q(z, g)}) Pfidz) ^ Wp{g, g) 



4 



DALALYAN AND TSYBAKOV 



Note also that if Assumption Q2 is satisfied for some (p.^p), then it is so for CP',^^ ) with any 

f3' > (3. In fact, condition ii) is ensured due to the concavity of the function t >-» fi 1 ^ on [0,oo), 
while iii) can be checked using the Holder inequality. 



3. Mirror averaging and a PAC-Bayesian bound in expectation 

We now introduce the mirror averaging (MA) estimator. First, we fix a prior tt e £?\{&a), a "tem- 
perature" parameter (3 > 0, and set 

expj-jE^QC^/A)} 



A exp { - i X™ 1 Q(Zi,f w )}n{dw) ' 

e A = e A ( Z ) = -|- £ e ra , x (Z) 

n + 1 m=0 

with 6o,a(Z) = 1. For every fixed Z, 6^ is a probability density on A with respect to the proba- 
bility measure tt. Let \x n be the probability measure on (A,£) having Q\ as density w.r.t. tt. By 
analogy with the Bayesian context, one can call Q\ and p„ the posterior density and the pos- 
terior probability, respectively. Following 1 30] where the case of discrete tt was considered (see 
also [41]), we define the MA aggregate as the corresponding posterior mean /„ = fp n , that is 
PnWA) = 7 ji T E^ =o em,A(Z)n(dA) and 



f n (Z,x) = [ / A (x)e A (Z)TiWA) = — !— f; ( f\{x)Q m \{Z) n{d\). 
J a n + 1 m= QjA 



(1) 



To simplify the notation we suppress the dependence of /„ on Z and x when it causes no ambi- 
guity. 

Theorem 1 (PAC-Bayesian bound in expectation). If Assumptions Ql and Q2 are fulfilled, then 
the MA aggregate f„ satisfies the following oracle inequality 

E f [£(f n ,f)]^ inf (f ?{f x ,f)p(d\) + ^ [P ' n) }, (2) 
pg^aVJa n + l i 

where J6{p,T\) stands for the Kullback-Leibler divergence 

'fA l °s[^(^))p(dX), ifp«n, 



^ip,n) = 



+00, otherwise 



Proof of Theorem Q] is given in the appendix. It is based on a cancellation argument that can be 
traced back to Barron [J]. 

The oracle inequality of Theorem [T] is in the line of the PAC-Bayesian bounds initiated in Ull 
and is applicable to a large variety of models. Some particularly relevant examples will be treated 
in Sections EHZl An interesting feature of Theorem Q] is that it is valid for a large class of prior 
distributions. 

The fact that (2) holds true for convex mappings g^ Q{Z, g) has been discussed informally in 0], 
p. 1606, as a consequence of an oracle inequality for a randomized estimator. A difference of 
Theorem [T] from the approach in y] is that the convexity of the loss function is not required. 
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Remark 1. If the cardinality of 3P ^ is finite, say card(^A) = N and A = {l A/), inequality © 

implies that 

f BlogTT/x 

E/[*(/„ ,/)] *S . min (£{fj ,f) + *-?-L . 

j=i,...,n\ n+l ) 

Oracle inequalities of this type and similar under different sets of assumptions were established 

earlier by several authors (cf. [l6, 17, 51, 52, H, 9, 3^, 2] and the references therein for closely 



related results). Our PAC-Bayesian bound (2} generalizes the oracle inequality of |30, Thm. 3.2] 
to arbitrary, not necessarily finite, family i^A. In the settings that we study below it is crucial to 
consider uncountable iF A . As we will see later, this generalization allows us to take advantage of 
sparsity and suggests a powerful alternative to the classical model selection approach. 

Remark 2. For the regression model with additive noise and deterministic design, PAC-Bayesian 
bounds in expectation on the empirical Z2-norm similar to J2J have been obtained in (II El SI 



l23ll for an EWA, which does not contain the step of averaging. Earlier 1 39] proved a similar result 
for the special case of finite card(i^A) and Gaussian errors. In the notation of the present paper, 
the aggregate studied in those works is of the form f n = f A f\Q ni \ii{d\). Interestingly, in a very 
recent paper Lecue and Mendelson 1 37] proved that /„ does not satisfy inequality {2} in the case 
of i.i.d. observations. 

Finally, we note that the results of this work hold only for proper priors. However, it is very likely 
that Theorem [T] extends to the case of improper priors under some additional assumption ensur- 
ing, for instance, that the integral f A exp { - h Y^ =l Q[Zi,f w )}n{dw) appearing in the definition of 
the MA estimator is finite. 



4. Sparsity oracle inequality 

In this section we introduce a prior n that we recommend to use for the MA aggregate under 
the sparsity scenario. Then we prove a sparsity oracle inequality (SOI) leading to some natural 
choices of the tuning parameters of the prior. 

4.1. Sparsity prior and SOI. In what follows we assume that A £ IR M for some integer M ^ 2. 
We will use bold face letters to denote vectors and, in particular, the elements of A. We denote 
by Tr(A) the trace of a square matrix A. To deal with integrals of the type f A Hf\,f) p{d\) we 
introduce the following additional assumption. 

Assumption L: For every fixed / e there exists a measurable set A c A such that A \ Ao has 
zero Lebesgue measure and the mapping Lf : Ao -» U, where £/(A) = Hf\,f), is twice differen- 
tiable. Furthermore, there exists a symmetric Mx M matrix Jl such that M - V 2 L^(A) is positive 
semi-definite for every A. e Ao, where V 2 L^(A) stands for the Hessian matrix. 

We are interested in covering the case of large M, possibly much larger than the sample size n. 
We will be working under the sparsity assumption, i.e., when there exists A* e U M such that / is 
close to f\* and A* has a very small number of non-zero components. We argue that an efficient 
way for handling this situation is based on a suitable choice of the prior 7t. To be more precise, 
our results will show how to take advantage of sparsity for the purpose of prediction and not for 
accurate estimation of the parameters or selection of the sparsity pattern. Thus, if the underlying 
model is sparse, we do not prove that our estimated model is sparse as well, but we claim that 
it has a small prediction risk under very mild assumptions. Nevertheless, we have a numerical 
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evidence that our method can also recover very accurately the true sparsity pattern [22], |23J . We 
observed this in examples where the restrictive assumptions mentioned in the Introduction are 
satisfied. 



Let t and R be positive numbers. The sparsity prior is defined by 

M 



If 1 

7T idk) = —\]]{r 2 + Xh- 2 \n\\\h<R)d\, 



(3) 



where ||A||i = Ej|Ay| stands for the ^i-norm, !(•) denotes the indicator function, and C lr R is a 
normalizing constant such that n is a probability density. 

The prior © has a simple heuristical interpretation. Note first that R is a regularization param- 
eter, which is typically very large. So, in a rough approximation we may consider that the factor 
1(|| A. Hi ^ R) is almost equal to one. Thus, tt is essentially a product of M rescaled Student's dis- 
tributions. Precisely, we deal with the distribution of \/2tY, where Y is a random vector with 
i.i.d. coordinates drawn from Student's t with three degrees of freedom. In the examples below 
we choose a very small t, smaller than 1/n. Therefore, most of the coordinates of tY are very 
close to zero. On the other hand, since Student's distribution has heavy tails, there exists a small 
portion of coordinates of tY that are quite far from zero. 

The relevance of heavy tailed priors for dealing with sparsity has been emphasized by several 
authors (see |46, Section 2.1] and references therein). Most of this work is focused on logarith- 
mically concave priors, such as the multivariate Laplace distribution. Also in wavelet estimation 
on classes of "sparse" functions 1 29] and [44] invoke quasi-Cauchy and Pareto priors respectively. 
Bayes estimators with heavy-tailed priors in sparse Gaussian shift models are discussed in 01 . 

We are now in a position to state the SOI for the MA aggregate with the sparsity prior. The result is 
even more general because it holds not only for the MA aggregate but for any estimator satisfying 
l[2) with the sparsity prior. 

Theorem 2. Let f n be any estimator satisfying inequality 0, where the loss function £ satisfies 
Assumption L and n is the sparsity prior defined as above. Assume that A contains the set B\ (R) = 
{A e U M | || A|| i ^ R} with R > 2Mt. Then for all A* such that || A* h =S B- 2Mt we have 

46 M 

E f [£(f n j)] ^ e[f k ;f) + — E iogd +t- j ia; d + rcm.t), (4) 

n + 1 £Tj 1 
where the residual term is R(M, t) = 4T 2 Tr(^) + -j^-. 



Proof of Theorem[2]is deferred to Section [931 of the appendix. 

As follows from 10}, the main term of the excess risk Ef[£[f n ,f)] -£[f\*,f) is proportional to 
£y=ilog(l + T_1 IA* I). Importantly, the number of nonzero elements in this sum is equal to the 
number of nonzero components of A* that we will further denote by || A* Ho- Therefore, for sparse 
vectors A* this term is rather small. But still, in all the examples that we consider below, it dom- 
inates the remainder term R(M, t), which is made negligible by choosing a sufficiently small t, 
for instance, t = 0((Tr(^)n)~ 1/2 ). 

Theorem [2] implies the following bound involving only the £ o norm and the upper bound R on 
the £\ norm of A*. 
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Corollary 1. If some estimator f n satisfies the oracle inequality of Theorem^ then 

KfWn,f)] < '(A*,/) + \ ~ + R(M,T), 

J n + 1 

where A* and R(M, t) are as m Theorem^ 

Proof. Set M* = || A* Ho for brevity. Using Jensen's inequality, we get 

Y m 

— £ log(l + t" 1 | A? I) ^ log(l + (tMT 1 II A* h). 

Using the inequalities || A* ||i sS R and M* 5= 1, the desired inequality follows. □ 

Note that the sparsity oracle inequalities (SOI) stated in this section are valid not only for the 
MA aggregate but for any other estimator (whose definition involves a prior jt) satisfying a PAC- 
Bayesian bound similar to {2j, possibly with some additional residual terms that should be then 
added in the SOI as well. Examples of such estimators can be found in 01. 

Remark 3. Assumption L need not be satisfied exactly. In fact, Lf{-) need not even be differ- 
entiable. Inspection of the proof of Theorem [2] reveals that if £/(A) is well approximated by a 
smooth function L/(A), that is ^ Lf{\) -Lf{\) ^ e, VA, for some small e > and if M e - y 2 Lf is 
positive semidefinite, then the conclusions of Theorem [2] hold with a modified residual term 

R e (M,t) = e + 4T 2 Tr(../4) + — — . 

n + 1 

This remark will be useful for studying the problem of classification under the hinge loss where 
the function Lf is not differentiable, cf. Section [71 

4.2. Choice of the tuning parameters. The above sparsity oracle inequalities suggest some guide- 
lines for the choice of tuning parameters t and R: 

(1) Parameter t should be chosen very carefully : It should be small enough to guarantee the 
negligibility of the residual term but not exponentially small to prevent the explosion of 
the main term of the risk. A reasonable choice (which is not the only possible) for t is 



t = min | 




(5) 



<Mn 

For this choice of t we have: 

(a) the residual term R(M, t) is at most of order p/n, 

(b) the terms log(l + |A* |/t) increase at most logarithmically in M and in n under the 
condition that Tx{jk) increases not faster than a power of M. Note that TrC^O = 
O(M) in all the examples that we consider below. 

(c) the MA aggregate is accurate enough if there exists a sparse vector A*, with ^i-norm 
bounded by RI2 which provides a good approximation fx of /, 

(2) It is clear that one should choose R as large as possible in order to cover the broadest 
class of possible values A*. However, we are not aware of any example where Assumption 
Q2 holds with finite p for R = +oo or, equivalently, for A = U M . Therefore, we assume 
that R is an a priori chosen large parameter and interpret the above results as follows: If 
there is a sparse vector A* such that £(f\*,f) is small and || A* Hi ^ R — 2Mx, then the MA 
aggregate has a small prediction risk. 
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Remark 4. The choice t = mm\-=== , -nti \ ensures that the estimator f„ is invariant with re- 

V v/Tr(^)« 4M j ■>" 

spect to an overall scaling of A. More precisely, if instead of considering the parametrization 
{/a : II A|| i ^ R} we consider the parametrization {f w : \\o)\\ \ =S R/s\ with f w = f S(t) for some s > 0, 
then the MA aggregate based on the prior defined by {3j remains unchanged. This can be easily 
checked by the change of variables using the relation Ji = s 2 Jt where M, denotes the Hessian 
matrix analogous to Ji for the dictionary {f M }. 



Along with choosing the parameters {t,R) of the prior, one needs to choose the "temperature" 
parameter f>. A model-free choice of p seems to be impossible. In fact, even the existence of (3 
such that Assumption Q2 holds is not ensured for every model. Some more discussion of the 
choice of (3 is given in Remark 7 below. 



5. Application to regression with random design 



5.1. Regression estimation in L 2 -norm. Let 2 = SC x U and we have the i.i.d. observations Z, = 
{Xi, Yt), i = l,...,n with X; e SC and Y/ e R. We define the regression function by /(x) = E{Yi\Xi = 
x), Vx e SC, and assume that the errors 

fi = y/-/(Xi), i = l,...,n, 

are such that E[£ 2 ] < oo. Then E(^/|X,) = 0. Let Px denote the distribution of X\. For 5 e [l,oo] 
we denote by || • \\p x , s the L s -norm with respect to Px- We also denote by {-,-)p x t° the scalar 
product in L 2 {3£,Px)- Throughout this section we consider the integrated squared loss Hf,g) = 
11/ - g\\p x 2 - Then it is easy to check that Assumption Ql is fulfilled with 

Q[z,g) = (y-g(x)) 2 , z = {x,y)eZ. 

Furthermore, we focus on the particular case where is a convex subset of the vector space 
spanned by a finite number of measurable functions {cf>j} ;=1 M c L 2 {3C,Px), that is 

&A = \f\= AeR M with IIAIIi^J?} (6) 

;'=i 

for some R > 0. Then assumption L holds with Jl being the matrix with entries (tyj, tyj>)p x , which 
will be referred to as the Gram matrix. This definition of Jt will be used throughout this section. 
The collection of functions {<§\ , . . . , c|>m} will be called the dictionary. 

Remark 5. The value of the parameter t presented in (5) does not allow us to take into account 
the possible inhomogeneity of functions cf>y . One way of dealing with the inhomogeneity is to let 
t depend on j in the definition of the sparsity prior tt. In this paper we consider for brevity a 
less general approach, which is common in the literature on sparsity. Namely, we normalize the 
functions cf>y in advance and use the same t for all coordinates of A. The normalization is done 
by rescaling the functions cf>y so that all the diagonal entries of the Gram matrix M are equal to 
one. 

Following this remark, we assume that the functions <pj are such that ||cf>y||p x> 2 = 1 for every j. 
Therefore, Tr(^) = M. 

Proposition 1. Assume that for some constant L§> Q we have msx.j=i,...,M \\§j\\p x ,oo ^ If, in 
addition, the errors £ have a bounded exponential moment: 

3b,a 2 >0 such that E(e ffl |^i) =S e" 2 ' 2 ' 2 , V\t\*b, P x -a.s., (7) 
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then, for every |3 2* max(2a 2 + 2sup AeA \\f\ -f\\ p ,4:RL^/b), the MA aggregate f n defined by (7]) 
with the sparsity prior [3|) satisfies 

E/ [\\fn ~ ffp x , 2 ] < inf i II /v - /||| Xj2 + £ log(l + T" 1 1 A* |) | + R(M, t) (8) 
where the inf is fafcen ewer aZZ A* suc/i ffaaf || A* || i ^ R - 2Mi and R(M, t) = 4t 2 M + -j^y . 



Proof of Proposition^ In view of Theorem [2] it suffices to check that Assumption Q2 is fulfilled 
for (3 5= max(2o 2 + 2sup AeA \\f\- f\\p ,ARL^Ib). This is done along the lines of the proof of (30L 
Corollary 5.5]. We omit the details. □ 



Proposition Q] can be used in signal denoising under the sparsity assumption. A typical issue 
studied in statistical literature, as well as in the literature on signal processing, is to estimate a 
signal / based on its noisy version recorded at some points X\,...,X n , under the assumption 
that / admits a sparse representation w.r.t. some given dictionary {cf>j; j = 1,...,M}. By sparse 
representation we mean a linear combination of a small number of functions cf> j . Assume for 
the moment that the noise satisfies Q with b = +oo and some known a e [0,oo) and that the 
unknown signal is bounded by some constant that can be assumed to be equal to 1. The latter 
assumption is fulfilled in many applications, as for example in image processing. 

The method that we suggest for estimating a sparse representation of /, under the assumption 
M 5* n, consists of: 

a) normalizing the functions cf> j , 

b) fixing a parameter R > 0, 

c) setting 



6 = 2a 2 +2CRL ( K + l) 2 , T = min( — ^ .—). 



(9) 



d) computing the MA aggregate /„ = T.p = i^j<pj with coefficients Xj = f RM XjQ\Ti{d\) based 
on the sparsity prior (3) and the posterior density 

! n+i exp { - j L? =1 (Yi - MXi)) 2 } 



8x = — I 
n+ 1 ±f, 



n + 1 J A exp{ - JZ^CXi - fM)) 2 }n {d w) ' 

In view of Proposition [TJ if we run this procedure with some value R > 0, we will get accurate 
estimates for signals that are well approximated by a sparse linear combination of functions cj)j, 
provided that the coefficients of this linear combination have an ^i-norm bounded by R - 2Mt. 
In most of the problems arising in signal or image processing the ^i-norm of the best sparse 
approximation to the signal is unknown. It is therefore important to make a data-driven choice 
of R. Let us outline one possible way to do this. Consider that only the signals formed by a 
linear combination of at most M* functions c|)y are of interest, and assume that the dictionary 
{<pj} satisfies the restricted isometry properly (RIP) of order M* , see equation (1.3) in tl5J] for the 
definition. In other terms, assume that / « f\* with ||A*||o ^ M* and ||/\*||p x> 2 ^ pA*||2 where 
|| • II 2 is the Euclidean norm. Then we can bound the ^i-norm of A* as follows: 



\Px,2- 
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We can estimate \\f\\ 2 p x2 consistently by ^L" =1 (F 2 - a 2 ). Based on these estimates, we suggest 
the following data-driven choice of R: 

^ rM* " ?1 i/2 

where x+ = max(x, 0) and M* a prior approximation of the sparsity index of the signal /. 

Remark 6. The choice of (5 in (9} requires the knowledge of a 2 , which characterizes the magni- 
tude of the noise. This value may not be available in practice. Then it is natural to consider p as 
a tuning parameter and to select it by a data-driven method, for example, by a suitably adapted 
version of cross-validation. This point deserves a special attention and is beyond the scope of 
the present paper. 

Remark 7. If the distribution Px of the design is unknown, it is impossible to normalize the 
dictionary functions (pj. In such a situation, i.e., when the functions tyj do not necessarily satisfy 
ll<|>/llp*,2 = 1> the claim of Proposition [T] continues to hold true with the modified residual term 
R(M,t) = 4T 2 Tr(^) + ^y, which can be bounded by 4t 2 ML 2 3 + ^-p Thus, once again, choosing 
t as in lO makes the residual term R(M, t) negligible w.r.t. the main terms of the risk bound. 

Remark 8. Proposition Q] is in agreement with the main principles of the theory of compressive 
sampling and sparse recovery, cf., e.g., 1 14]. Indeed, if the tuning parameters are well-chosen, 
the prediction done by f n can be quite accurate even if the sample size is relatively small with 
respect to the dimension M. This happens if the signal admits a M* -sparse representation in a 
possibly overcomplete dictionary of cardinality M. Then the number of observations sufficient 
for an accurate prediction is of order M* up to a logarithmic factor. Proposition [T] is also in 
perfect agreement with the principle of incoherent sampling (see, for instance, 11411 , page 10). In 
fact, in our setting, the incoherence of the sampling is ensured by the fact that cf>y e L 2 {3C,Px) 
satisfy l|(J>yllp x ,2 = 1- 

Before closing this section, let us mention the recent work HH , where some interesting results 
on the aggregation of estimators in sparse regression are obtained. 



5.2. Linear regression with random design. Consider now the case of linear regression. Assume 
that the i.i.d. observations (X,, F,), i = 1, .. ., n, are drawn from the linear model 

Yi = x]\*+$i, i = l,...,n, (10) 

where Xi e U M are i.i.d. covariates and A* e U M is the parameter of interest. Then our method 
reduces to estimating A* by 

1 " +1 r 
A« = X I A6 m ,A.7T(dA), 

where jt is the sparsity prior and 

g expl-fr^^-XjA) 2 } 
mX f^expi-p-^^iYi-Xj^Tiido))' 
Then the following result holds. 

Proposition 2. Consider the linear model flOl ) satisfying the above assumptions. Let the support 
of the probability distribution ofX\ be included in [-1, 1] M and E[e^ l \Xi} ^ e a 1 12 for all teU. 
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SefL x = E[X 1 X]]. Then for any |3 ^ 2o 2 + 2(i? + || A* Hi) 2 andanyX* such that \\X*h^ R-2Mi we 
have 

E[||Zy 2 (A„-A*)||l] ^ -A- [l + 4g logd + T^IXjDjI + ^Urffx). (11) 

This proposition follows directly from Proposition Q] by setting cf> y (x) = Xj if \Xj\ =£ 1 and <pj{x) = 
if |jCy| > 1, where x e U M and Xj is its y'th coordinate. Note also that here we have M = Z x . 

5.3. Rate optimality. In this section, we discuss the optimality of the rates of aggregation ob- 
tained in Proposition [TJ We show that the MA aggregate with the sparsity prior attains, up to 
a logarithmic factor, the optimal rates of aggregation (cf. 1 47]). Furthermore, f n is adaptive in 
the sense that it simultaneously achieves the optimal rates for the Model Selection (MS), Convex 
(C) and Linear (L) aggregation. In what follows, these rates are denoted respectively by i//Jf s (M), 
V^(M) and ty\(M). It is established in Ha] that: 

y/™ s {M) = n-HogM, 
Vn(M) = b _1 (Ma v/n)k>g(l + Mn" 1/2 ), 
y/^{M) = n~ l M. 

We wish to compare the risk of the estimator /„ with the sparsity prior jt to the smallest error 
II A* ~f\\p x 2 where A* is one of A MS , A c or A L such that 

' c=ar8 iSi iA " /i2 ^ 

AL = arg A^M ll/A " /l12 ^ 2 - 

In the next proposition we denote by c constants which do not depend on M and n. 

Proposition 3. Assume that f n satisfies 0) with some (3 > independent of M and n, and that 
log(M) =s c^n for some constant cq. IfR > 4 and T satisfies {5|) with Tx{M) = M, then 

E/[|[/„ -/IIp x>2 ] II / x m " /Hp x ,2 + c^ S (M)log(l + nM] 

and 

E/[|l/ ra -/ll| x>2 ]=£ HAc-/llp x ,2 + q^CM)log(l + nJM). 
Finally, j7I|A l ||i ^R-2Mt, then 

E/[|l/„ - ff Px}2 \ « II A L _ /Hp x ,2 + c^(M) log(l + nM). 

Proof. For model selection and linear aggregation the result follows immediately from OU by 
putting there A* = A MS or A* = A L and using that ||A MS || = II A MS ||i = 1. The case of convex aggre- 
gation with M ^ \fn follows from the bound for the linear aggregation. The case M > \ph requires 
some additional arguments, which are presented below. 

Let s = s n be the integer part of \fn, denoted by [y/n\. We assume that A c has at least s n non- 
zero coordinates, the case ||A c ||o < \\fn\ being a trivial consequence of (8). Using the Maurey 
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randomization argument as in 0, 0] , one can show that 

min IA-/1},, < | A . - /!»„ + Jjffi < l&. - /III,, ♦ i. (12, 

IIAIIoSs 

Let A sC be a point where the minimum on the left hand side of (T2j is attained. Since \ s C 

has not more than 5 nonzero coordinates and || A s,c ||i ^ 1, we have E ; Tog(l + |A^.' c /t|) =S slog(l + 

J j 

\\\ s ' c \\iIt) ^ slogd + T" 1 ). Thus, applying © to A* = A sC and using ©, we get 

E/[|l/ re -/l[p x , 2 ] ^ ll/ A ,c - ff Px , 2 + 2i_ l t ( i3) 

where c is some constant independent of n and M. Recall now that ||/ A s,c - f\\ 2 Px 2 is equal to the 
left hand side of {12}. This implies 

~ 9 1 CSlOgQ + T _1 ) 
*ftt\fn-ftp x ,2\ < Ux« ~ ffp x ,2 + - + — n • 

which leads to the desired result due to the choice 5 = [\/n] and 10. □ 

Remark 9. The theory developed here relies on the fact that the risk is measured by the expected 
squared loss. In the case of general L p -loss with p > 1, a universal procedure for aggregation is 
proposed in j27| and it is proved that the aggregation in L p for p > 2 is more difficult than it is in 
L 2 . 



6. Application to density estimation 



Let X\ , . . . , X n be the observations, which are independent copies of a random variable X : Q — ► 3C 
whose distribution has a density / with respect to some reference measure u. We consider the 
problem of estimating / based on X\, . . . ,X n . We measure the risk of an estimator / of / by the 
integrated squared error 

nf,f) = \\f-nl2 = f . (/(*) - /w) 2 v^idx). 

Define the mapping Q(-,g) :S£x L 2 [3£, u) — U by 

Q(x,g) = II g\\ 2 l>2 -2g{x). 

It is straightforward that EfQ(X,g) -£{g,f) = -||/ll^ 2 ana -> therefore, Assumption Ql is fulfilled. 
To further specify the setting, we consider the family defined in l[6) where the functions cp>^- 
are chosen from L 2 {3£, u) so that ||(J)/|ln, 2 = 1 and II cpy || ^ jOQ « L, j = 1,...,M, for some positive 
constant L. Note that the functions (j>,- need not be integrable or positive. We have the following 
result. 

Proposition 4. Let the assumptions given above in this subsection be satisfied and II /"II ^,00 ^ L. If 
P is such that 

(P-2i? 2 )e" 4iJ(i+vT)/|3 &2L + 4RL, (14) 
then the MA aggregate f n based on the sparsity prior satisfies 

E/[ii/» -nl^ * in , f 1 11 a* -nl,2 +— |-r Z lo gd + t-1 i\* i)} + r ( m - t ) (is) 

\ \ n + 1 ■_ j j 

where the inf is taken over all the vectors A* such that || A* ||i sS R - 2Mx. 
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The proof of this proposition is given in the appendix. It consists in checking that Assumptions 
Q2 and L are satisfied and then applying Theorem |2j Condition {T4J can be significantly simpli- 
fied in many concrete situations. For example, if we assume that R = 1 or R = 2, then one can 
choose p = 12L and p = 231 respectively provided that L 3* 2. 

7. Classification 

Assume that we have a sample (Xi, Yi),...,{X n , Y n ), where X ; - e 3C and F,- e {-1, +1} are labels. 
Here 3C is an arbitrary measurable space and (X,-, F,) are assumed to be generated independently 
according to a probability distribution P. The goal of binary classification is to assign a label +1 
or - 1 to a new random point x which is distributed as X, and independent of Xi , . . . , X n . 

The problem of interest in classification is to design a classifier / : 3C — - U having a small misclas- 
sification risk R[f] = +1] fl(sgn(/(x)) ^ y)P(dx,dy). Denote by r\ : 90 — [-1, 1] the regres- 

sion function 

T]{x) = E(Fi|Xi = x) = 2P(Fi = l|Xi = x) - 1, Vief. 

The Bayes classifier is defined as follows: fix) = fl(r7(x) > 0) - l(r?(x) ^ 0) = sgn(ry(x)). One easily 
checks that 

R[f]-R[f]= [ ll(sgn(/(x))^/(x))|77(x)|P z (rfx), 

where Px is the distribution of Xi. This shows that the Bayes classifier / minimizes the misclas- 
sification risk. Clearly, the Bayes classifier is not available in practice because of its dependence 
on the unknown regression function ry(-). 

This problem is a special case of the general setting of Section [2] if we take there Z, = (X,-,F) 
and #{g,f) = Rig] - R[f]. Assumption Ql is then fulfilled with Q{z,g) = l(sgn(g(x)) = y) where 
z = [x, y) . However, Assumptions Q2 and L are not satisfied. 

7.1. Classification under smooth $-losses. An alternative approach is to consider the $-risk of 
classifiers. For a fixed convex twice differentiable function $ : U —- U+, the O-risk of a classifier / 
is defined by 

R [f]= [ ®{-yf{x))P{dx,dy) = \[ {*( -/(*)) (1 + ij(jc)) +<b{f{x))[\-t]{x))\Px{dx). 

J3£x{±\} 2 I > 

In this subsection, we are mainly interested in the four common choices of $ presented in the 
top lines of Table [TJ For these and other loss functions, sharp relations between the <5-risk and 
the misclassification risk of a given classifier / have been established in |55], [5I- In particular, it 
is proved in these papers that the minimum of $-risk is attained at any classifier satisfying 

/o(x) eargmin{a>(-w)(l + r7(x)) + <J>(w)(l-77(x))}, VxeST. 

Note however that in practice the computation of is impossible because of its dependence on 
the unknown r\. 

Our aim here is to design a classifier having a $-risk which is nearly as small as the minimal 
possible O-risk. This task can be recast in a problem of estimation where /$ is the function to 
be estimated and the quality of an estimator (classifier) / is measured by the excess risk R<& [f] - 
R®[f®]. Therefore, this is a particular case of the setting described in Section [2] with £{g,f) = 
t®(g>fy) = R®ig] -R®[fif>] and Q{z,g) = 3>(-yg(x)) for every z = [x,y). Here Assumption Ql is 
obviously satisfied. 
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Loss 


<»(") 


fiplx) 


Q(z,g) 


P* 


Co 


Squared 

Truncated Squared 

Boosting 

Logit-Boosting 


(1 + u) 2 

{max(l + u,0)} 2 
e u 

log(l + e u ) 


T]{X) 
T]ix) 

i j 1+17&O 
log 1+J?W 

1U 6 1-)?(X) 


(l-yg(x)) 2 
{max(l-yg(x),0)} 2 

e -ygw 

log(l + e-^ (x) ) 


2(1 + RL®) 2 

lil + RL®) 2 
e RL® 


8 
8 

4 


Misclassification 
Hinge 


t{u=l) 
max(l - u,0) 




Kg(x) * y) 

max(l-yg(x),0) 







TABLE 1. Common choices of function <I>; classifiers minimizing the <J>-risk; the cor- 
responding functions Q; constants (3$ and Gj> appearing in Proposition!!] 

In the same spirit as in the previous sections, we assume that we are given a dictionary {cj> 7 - } j=i m 

of functions on 3£ with values in U. The family 3P\ is defined as the set of all linear combinations 
of the functions cf> j with coefficients Ai , . . . , Am, such that the vector A. = (Aj , . . . , Am) belongs to the 
£\ ball with radius R, cf. (§}• The next proposition shows that a strong sparsity oracle inequality 
holds for an appropriate choice of p. 

Proposition 5. Assume that for some constant L^>0 we have maxj = i m II <Pj II p x ,°o ^ Let the 

function $ be twice continuously differentiable with^ 



Po := sup 



<5'(a) 2 



< oo. 



Then ffae AM aggregate defined with P ^ Po and if zYfa ffae sparsity prior 0) satisfies 

( 40 M \ M 6 

E/[^o(/„,/)]^ , min ^(/ A ,,/) + _^^log(l + T- 1 |A*|) + CoT 2 £ ||i|);||| 2 + -t-- 

where Co = 4rnax| M | sSfi£ <t>''(w). 



, (16) 



Proof. We apply Theorems [T] and |2 First, we show that Assumption Q2 is satisfied. Recall that 
Q{z,g) = <S(- yg(x)) and setWp{g,g) = f S £ x{±l] exp[-fi- 1 {Q{z,g)-Q{z,g)})P{dx,dy). Let us show 
that for P ^ Po the mapping g >-* (g, g) is concave. By standard arguments, this reduces to 
proving that the function f >-» <p[t) = x ¥ft(.tg+ (1 - £)g,g) is concave on [0, 1] for every fixed g, g 
and g. A simple algebra shows that the second derivative of if) is non-positive on [0, 1] whenever 
P 3= <5'(-yg(x)) 2 /0"(-yg(x)) for all (x,y) eSC x {+1} and all ge^ A . On this set of x,y,g the value 
-yg(Jt) belongs to the interval [--RL^ifL^]. Thus, Assumption Q2 is satisfied for P ^ Po and 
Theorem Q] can be applied. 

To use Theorem [2] it remains to prove that Assumption L is satisfied with Jt being the matrix 
with entries (3C0 ((]>/, (J)/)), where j and / run over {1,...,M}. From the formula for R$,[f] given 
at the beginning of this subsection we get 

V 2 L / (A) = V 2 j Ro[/a]= f [Vfxix)-Vf x (x) T )<l>"{-yfxix))P(dx,dy). 

Since yf\{x) e [-RL^RL^] the matrix M - V 2 L/(A), where M = ±Co / (V /\V /^)(x) P x {dx), is 
positive semi- definite. The desired result follows now from the linearity in A of f\{x). □ 



We use here the convention 0/0 = 0. 
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For the four common choices of $ presented in the top lines of Table [T] all the conditions of 
Proposition [5] are satisfied for a properly chosen constant p. The minimal values of p, as well 
as the values of the constant Cq>, for each loss function <5 are reported in the last two columns 
of Table [TJ It is often interesting to use binary classifiers <pj [i.e., functions with values in {+1}), 
in which case Lq, = 1. Also note that the expressions for p^ suggest to choose R not too large, 
especially in the case of the boosting and the logit-boosting losses. 

7.2. Classification under the hinge loss. One of the key issues in machine learning is classifica- 
tion by support vector machines. They correspond to a penalized O-risk classification with the 
loss O(u) = = max(l + u,0), referred to as the hinge loss. A notable feature of the hinge loss 

is that the classifier f& H (x) equals sgn(7j(x)) and therefore coincides with the Bayes classifier for 
the misclassification risk. However, since the hinge loss does not satisfy Assumptions Q2 and L, 
Proposition [5] cannot be applied. Furthermore, as shown in (5(1, no aggregation procedure can 
attain the fast rate of aggregation [i.e., the rate I In up to a logarithmic factor) when the risk is 
measured by the hinge loss. 

The reason for the failure of Assumption L is that the hinge loss is not continuously differen- 
tiable. One can circumvent this problem by using the smoothing argument of Remark|3j Indeed, 
let us fix e > and introduce the function K e {z) = iVe 2 + z 2 - e)l{z > 0), which is a smooth ap- 
proximation to the positive part of z. It is easy to see that K e [z) =s max(z,0) ^ K e iz) + e and that 
K"{z) = e 2 (e 2 + z 2 ) 2 e (0,e _1 ] for z > 0. This allows us to approximate the loss £® H ig,f) by 

4 (g, /) = ■=[ {K e a ~ gix)) (1 + rito) + IQ (1 + g{x)) (1 - T]{x))} Pxidx) - R 011 If] . 

2 JSC 

Although Assumption Q2 is not fulfilled, the next proposition shows that it is possible to adapt 
the argument of Proposition [5] to the hinge loss $ = $«. However, unlike Proposition[5]where the 
rate of convergence is of the order l/n (up to a logarithmic factor), the resulting sparsity oracle 
inequality has only the rate \l\fn (up to a logarithmic factor), cf. also Remark[[0](l) below. This 
is the best we can get for the hinge loss without imposing any condition on ry. 

Proposition 6. Let^n(u) = max(l + u,0) be the hinge loss awdmax^i,...,^ ||tf>j \\p x ,oo ^ L§ for some 
La, > 0. Then, for every (3 > the MA aggregate f n based on the prior given by 0) satisfies 

I 4B W , . \ 2(1 + flU) 2 l+RL * - 

Ef[U„ifn,f)]* min ^(/ A ,,/) + _tl-^log(l + T- 1 |A*|) + -^—e t +R(M,t), 

||A*|Ii=SK-2MtV n+lf^y 1 I p 

where R(M, t) = 4tL^\/M + P(n + l)" 1 . 

The proof of this proposition is given in the appendix. 
Remark 10. 

(1) Consider the sparsity scenario, i.e., assume that for some vector A* having at most M* 
non-zero coordinates, the excess risk (<t> H if\*,f) is small and ||A* Hi «s RI2. Proposition [6] 
with the choice of p = (1 + RL^)Vn/ M* and t = min(-^=, ^) leads to the sparsity oracle 
inequality 

/ 0- + rIm)VaF, , ,x 

Ef[U H (f n ,f)]< min [£ 9s tf k ;f)+ j= C + 4log(l + t" 1 || A* h)} , 

||A*|| SM* 

where C > is a constant independent on M, M* and n if M* ^ n. This result is valid 
for arbitrary n. It should be noted that the MA aggregate /„ satisfying this SOI depends 
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on the upper bound M* on the sparsity level, which is not always available in practice. 
Constructing a classifier independent of M* and satisfying the above SOI is an interesting 
open problem. 

(2) An important special case is a dictionary composed from a large number of simple binary 
classifiers cf>y : SC —- {±1}. If we choose R = 1, all aggregates f\ with ||A|li =s R, as well as 
their mixtures, take values in [-1,1], and therefore the function Q[z,f\) associated with 
the hinge loss is linear in A. This property has two important consequences. The first 
one is that Assumption L holds with Jl = and it is no longer necessary to smooth out 
the function £/(A) and to use Remark |3] in the proof Proposition [6] Thus, the residual 
term R is equal to P(n + l)" 1 . The second consequence is computational, related to the 
Langevin Monte-Carlo approximation of the MA aggregate briefly described in Section l8T2l 
below. Namely, in this case we have strong mixing properties that are independent of 
the ambient dimension M, due to the independence of the coordinates of the Langevin 
diffusion. 

(3) According to |36], if the underlying distribution P satisfies the margin assumption of |48], 
then the rate of aggregation can be substantially improved. It would be interesting to 
investigate whether this property extends to the sparsity scenario. It is likely that one of 
the randomized procedures of |3J] used in conjunction with our sparsity prior can yield 
an aggregation rate optimal classifier. 



8. Discussion 

8.1. Comparison with other methods of sparse estimation. In this paper we have proved spar- 
sity oracle inequalities (SOI) in a setting, which is important but not much studied in the litera- 
ture on sparsity. We considered the i.i.d. random sampling and we measured the quality of esti- 
mation/prediction by the average loss with respect to the distribution of Z = [X, Y), namely, our 
main example was the loss ({g,f) = f% Q{z,g)Pf{dz). Most of the literature on sparse estimation 
is focused on the high-dimensional linear regression model with fixed design, so the data are not 
i.i.d. and the empirical prediction loss, rather than the average loss is considered. Notable excep- 
tions are the papers I13l|33j,l3j,l35l|49] where the framework is similar to ours. Am ong these, |3£ 
focuses on regression with random design and study the Dantzig selector, while 
analyze the penalized estimators of the form 



\ n = argmin 

AeA 



\ n i=l 



ZQ(Z,A) + Pen(A) 



where Pen(A) is a penalty, which is equal or close to the £± -penalty r||A||i with a suitable regu- 
larization parameter r > 0. For the penalized estimator /„ = they prove SOI of the form (here 
we give a "generic" simplified version based on 1 338): 

(17) 



e{f n ,f)< min [mfr,f)+ C{l + R )M se n , M ) 

IIA'lliS-R v nK n ,M ' 

HA'lloSM* 



with a probability close to 1, where C > is a constant independent of n and M, ££ n ,M is a factor, 
which is logarithmic in n and M, and x n> u is minimal sparse eigenvalue appearing in the condi- 
tions on the Gram matrix of the dictionary quoted in the Introduction. With the same notation, 
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a "generic" version of our SOI for the MA aggregate /„ is the following: 

W(fn,f)}* min [( {fy* , f) + —————— ——S£ n M\. (18) 

I|A*|| SM* 

There are two advantages of {T8j with respect to fPTt . First, QEJ is a sharp oracle inequality, 
since the leading constant is 1, whereas this is not the case in fPTt . Second and most important, 
COD holds under mild assumptions on the dictionary, such as the boundedness of the functions 
cf)j in some norm, whereas fPTt requires restrictive assumptions on minimal sparse eigenvalue 
Kn,M which can be very small and appears in the denominator. In particular, {T8j is applicable 
when K„ t M = 0. Finally we note that {T7J is an oracle inequality "in probability" while {T8j is "in 
expectation". Inequalities in expectation can be derived from the inequalities in probability of 
the form {T7J obtained in (13, [33L |34|, |49j only under some additional assumptions. So, strictly 
speaking, even more assumptions should be imposed in the case of Q7) to make possible direct 
comparison with l fT8t . 

In conclusion, we see that the oracle bounds for i\ -penalized methods, such as the Lasso or 
its modifications can be quite inaccurate as compared to the those that we obtain for the MA 
aggregate. 

The ^o-penalized methods for models with i.i.d. data are less studied. To our knowledge, this 
is done only for regression with random design 1 10] and for density estimation 140]. The oracle 
inequalities in those papers are less accurate than the ours since the leading constant there is 
greater than 1. Moreover, if we want to make it closer to 1, the remainder term of the oracle 
inequalities explodes. 

Furthermore, as mentioned above, our sparsity oracle inequalities are potentially applicable not 
only for the MA aggregate, but for any estimator associated to prior distribution tt and satisfying 
a PAC-Bayesian bound in expectation as in Theorem[TJ 

8.2. Computational aspects. If the dimension M is large the computation of the MA aggregate 
with the sparsity prior becomes a hard problem. Indeed, its definition contains integrals over 
a simplex in IR M . Nevertheless, accurate approximations can be realized by a numerically effi- 
cient algorithm based on Langevin Monte- Carlo. This algorithm along with the convergence and 
simulation studies is discussed in |22l 123B . Here we only sketch some main ideas underlying the 
numerical procedure. For simplicity, we consider the case of linear regression (cf. Subsection 
15.21 1 . The argument is easily extended to other models discussed in the previous section. 

Thus, assume that we have a sample (X,-, F,), i = l,...,n, and a finite dictionary {ct>y : 3C — ► U} of 
cardinality M. We wish to compute the expression 

- f uM \e-^ Y - F ^nid\) 

\=— — ; , (19) 

f RM e-^- F ^ln(d\) 

where F\(X] = {f\{Xi),...,f\(X„)) T and f\ = E^A/cJ);. A slight modification of the sparsity 
prior consists in replacing tt defined in © by 

/ M e -m(a\j) 



M g-UHCXA/J -1 

n ; ~ 2 r l(||A||i<i0rfA, (20) 
fJi (T 2 + Aj) 2 J 



where a is a small parameter and ffl : U — U is the Huber function: 6>{t) = t 2 M\t\ ^ 1) + [2\t\ - 
1)1(|?| > 1). Introducing the product of e - m ( aA /' i n the definition of the prior does not affect its 
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capacity to capture sparse objects, in the sense that the MA aggregate based on the prior j2Qt 
can be shown to satisfy a SOI which is quite similar to that of Theorem [2] (cf. HHH^ where the 
regression model with fixed design is treated) . On the other hand, this modification of the spar- 
sity prior makes it possible to rigorously prove the geometric ergodicity of the Langevin diffusion 
defined below. 



Note that we can equivalently write A in the form 



f nM \l{\\\h^R)p v {\)d\ 

A=— 7 T— TT ^ 7T~ 77 , (21) 



where py(A) oc with 

M 

V(\) = -fi" 1 IIY- F K {X)\\\- £ 2 log(T 2 + k)) + ffl(aAy) . (22) 

7=1 1 ' 

Consider now the Langevin stochastic differential equation (SDE) 

dL t = VV(L t ) dt+ V2dW t , L Q = 0, t^O 

where W stands for an M-dimensional Brownian motion. For our choice of the potential V this 
SDE has a unique strong solution. It can be also shown (cf. |22|,|23]) that this choice of V guar- 
antees the geometric ergodicity of the solution, which implies that its stationary distribution has 
the density py(A) oc e y(AJ , A e U M . This and (2l\ suggest the Langevin Monte Carlo procedure of 
computation of A. Indeed, consider the time averages 

L T = ^ [ T L t l{\\L t h^R)dt, S T = ^[ MU t \h^R)dt, T>0. 
T Jo T Jo 

According to the above remarks, the ratio of these average values converges, as T —- 00, to the 
vector A that we want to compute. Note that Lj and Sj are one- dimensional integrals over a 
finite interval and, therefore, are simpler objects than A, which is an integral in M dimensions. 
Still, one cannot compute Lj directly, and some discretization is needed. A standard way of doing 
it is to approximate Lj and St by the sums 

I [Tlh}-1 l lTim-1 

L "-'wm h t * 1< " I ' ll,<R >' s '-'im h n(ll ** R) ' 

where {L^} is the Markov chain defined by the Euler scheme 

L E k+l = L E k + hVV{L E k ) + V2hW k , L E = 0, k= 0, 1,..., [T/h] - 1. 

Here W\, W2,... are i.i.d. standard Gaussian random vectors in U M , h > is a step of discretiza- 
tion, and [x] stands for the integer part of x e K. It can be shown that L T h is an accurate approx- 
imation of Lj for small h. We refer to EEliHl for further details. The computational complexity 



is polynomial in M and n. Simulation results in 22, 23], as well as the experiments on image 
denoising ji^] , show the fast convergence of the algorithm; it can be easily realized in dimen- 
sions M up to several thousands. They also demonstrate nice performance of the exponentially 
weighted aggregate as compared with the Lasso and other related methods of prediction under 
the sparsity scenario. 
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9. Appendix 



9.1. Proof of Theorem [B First, note that without loss of generality we can set (3 = 1. If this is not 



the case, it suffices to replace Q and £ by Q = AQ and £ = hi, respectively. By Assumption Ql, 



E 



flHfnJ)] = f E f [Q{z,f n )]P f {dz)-A{f). (23) 

In the last display we have used Fubini's theorem to interchange the integral and the expectation; 
this is possible since the integrand is bounded from below. To get the desired result, one needs 
now to bound the first term on the RHS of (23), which we rewrite as follows 



f Ef[Q{zJ n )]Pf{dz) = -f E f [log[e^{-Q{z,f n )})]P f {dz). (24) 

Recall now that /„ is denned as the average of the functions f\ w.r.t. the probability measure p„. 
If we knew that the mapping g >-» exp { - Q(z, g)\ is concave on the convex hull of we could 
apply Jensen's inequality to get 

exp{-Q(z,/„)}s* / exp{-Q(z,/ A )}p n (d\). 

JA 

As we see below, this would allow us to get inequality © by a simple application of the convex 
duality argument. Unfortunately, the above mentioned concavity property is rather exceptional 
and therefore the quantity 

Si (z,Z) = log (f exp { - Q{z, / A )}p n (d\)) - log ( exp { - Q{z, /„)}) 

is not necessarily a.s. negative. However, we may write 

J Ef [ log {e- Qiz 'f" ] ) ] P f {dz)= J^E f [S {z, Z) - Si (z, Z) ] P f {dz) (25) 

where 

S (z,Z) = log ( JT exp { - Q(z, f\) \ p„(d\)) . 
By the concavity of the logarithm, 

S (z,Z) > £ logf [ e-Q^m m}X n{dX)). 

Replacing 8 mA by its explicit expression and taking the integral of both sides of the last display, 
we get on the RHS a telescoping sum. This leads to the inequality 

f Ef[S [z,Z}]P f (dz)^-—^f \og(J e~ E "=i Q(Zi,/ *WA))p^ +1) (dz). 

By a convex duality argument (cf., e.g., [25] , p.264, or [3, p.160), we get 

a l n+l n 

for every p e g? A . Therefore, integrating w.r.t. Z\,..., z n+ \ and using the symmetry, we get 

( E f [S {z,Z)]P f {dz)>- [ [ Q(z,/ A ) p{d\)P f {dz) - ^ {p '* ] 
J 2 J 2 J A n + l 

^{p,n) 



- ( £(fx,f)p{d\)-\(f) 
J a n + l 
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This and equations j2"3l-j25ll imply 

E f [£{f n ,f)]< [ £{f x ,f)p{d\) + Jf[p '^ ) + [ E f [S x {z,mPf{dz). (26) 
J a n + l J 2 

Let us show that the last term on the RHS of J26l is non-positive. Rewrite Si(z,Z) in the form 
Si(z,Z) = log f exp (- {Q(z,/ X ) - Q{z,f n )}) %{d\). 

JA 

By the Fubini theorem, the concavity of the logarithm and Assumption Q2, we get 

^E / [S 1 (z,Z)]P / (rfz)<E / [log£l' 1 (/x,/„)p„(dA) 
(recall that we set P = 1). The concavity of the map g >—■ (g, /„) and Jensen's inequality yield 

f yi{f\Jn)Vn(d\)^i(J fxv n {d\)j n )=yp(f n ,f n ) = l, 
and the desired result follows. 

9.2. Some lemmas. We now give some technical results needed in the proofs. 

Lemma 1. For every M e N and every s> M, the following inequality holds: 

1 r m du . M 

(3T/2) M ]{u:\\ u\h>s} )ii (1 + w 2 ) 2 * (5-M) 2 " 

Proof. Let U\,...,Um be iid random variables drawn from the scaled Student if (3) distribution 
having as density the function u^2l [tt(1 + u 2 ) 2 ] . One easily checks that E[t/ 2 ] = 1. Furthermore, 
with this notation, we have 

\r M du- i M \ 

i{u:„ ; n = p ( g 1 ^ 1 * s ) • 

In view of Chebyshev's inequality the last probability can be bounded as follows: 

I™ \ MWJ 2 ] M 

p y \u,-\ ^sU !- 



7=1 



(s-ME[|f/i|]) 2 (s-iW} 2 



and the desired inequality follows. □ 

Lemma 2. Let the assumptions of Theorem \2\ be satisfied and let po be the probability measure 
defined by lfM>2 then f A {\ y - A*) 2 p [d\) ^ At 2 . 

Proof. Using the change of variables u = (A - A*)/t we write 

f (Ai - A*) 2 p (d\) = C m t 2 [ u\[ T] (1+ u 2 T 2 ) du 

JA ' JBi(2M) y j = i 1 ' 

with 

i C ( M \ \-i 

C M = ITf 1 + B /J Um (27) 

*'JB 1 (2M) v y =1 J 7 ; 

where are the components of u. Extending the integration from B\{2M) to U M and using the 
inequality f R u\{\ + u\)~ 2 du\ =s jt, we get 

f (Ai-A^ 2 p WA)^C M T 2 7r([ (l + ? 2 )- 2 rff) M " 1 =2C M T 2 (7T/2) M , 
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where we used that the primitive of the function (1 + x 2 )" 2 is \ arctan(x) + 2 (i+ x 2 r ^° b ouna Cm. 
we apply Lemma [T] which yields 

C M ^(2/7t) M (1-1/M) _1 ^2(2/jt) M ( (28) 

for M ^ 2. Combining these estimates we get / A (Ai - A*) 2 po(.d\) ^ 4t 2 and the desired inequality 
follows. □ 

Lemma 3. Let the assumptions of Theorem |2 be satisfied and let po be the probability measure 
defined by fEB. Then J£{p Q ,Ti) s; 4l^ 1 log(l + |A* |/t) + 1. 

Proof. The definition of jt, po and of the Kullback-Leibler divergence imply that 

n I m (t 2 + A 2 ) 2 1 

JETCpo.Tt) = log T 3M C M C T , fl J! , 2 ,1 \*^2 IPOWV 

Jb, (2Mt) I / = i (t + (A/ - A* ) 2 ) 2 J r 

M r , T 2 + A 2 ) 

= log(T 3M C M C x ,«)+2X log , ' 2 Po(rfA). (29) 

; = 1 JBi (2MT) i T z + (Ay - Ay ) J 

We now successively evaluate the terms on the RHS of ( f29l . First, in view of CD, we have 

M 



C,,r = t- 3M ( Y\—^du^^ M l f a + uh- 2 duj) M = t- 3M (n/2) 

JBi(RIj) jJi (1 + u j) { JU 1 ' 



M 



This and (28J imply log(C M C T , fl ) s; log2 ^ 1. 

To evaluate the second term on the RHS of j29j we use that 



= 1 + -t-t; r^(A-/T) + ' 



t 2 + (A ; - - a; ) 2 t 2 + (A ; - - a* ) 2 v ; t 2 + (A ; - - a; ) 2 

^1 + |A*/t| + (A*/t) 2 ^(1 + |A*/t|) 2 . 

This entails that the second term on the RHS of (29} is bounded from above by JJ^i^logQ + 
I A* | It). Combining these inequalities we get the lemma. □ 

9.3. Proof of Theorem |H In view of inequality {2}, we have 

MT(p,TT) 



E/[^(/„,/)] =S ( £{f x ,f)p(d\) + - 

JA 



n + l 

for every probability measure p. We choose here p = po where po has the following Lebesgue 
density: 

^ ( A) oc ^- ( A - A* ) ll Bl ( 2Mt) (A- A*). (30) 
a\ a\ 

Here the sign oc indicates the proportionality of two functions. Since || A* || i ^ R - 2Mt, the con- 
dition A - A* e B\{2Mt) implies that A e B\[R) and, therefore, po is absolutely continuous w.r.t. 
the sparsity prior jr. By Taylor's formula and Assumption L we have 

£ {fx , f) = L f ( A) s; Lf ( A* ) + VL f ( A* ) T ( A - A* ) + ( A - A* ) T J£ ( A - A* ) , V A e A . 

Integrating both sides of this inequality w.r.t. po and using the fact that the density of no is sym- 
metric about A* and invariant under permutation of the components we find 

£^(/A,/)poWA)^L / (A*) + Tr(^)£(A 1 -A*) 2 p WA). (31) 
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Combining this inequality with those stated in Lemmas [2] and [3j we get the desired result. 

9.4. Proof of Proposition |H Note that Assumption Ql obviously holds and Assumption L is ful- 
filled with M being the Gram matrix. The diagonal entries of M. are equal to one since ||cf>j || ^2 = 
1, and therefore we have Tx{J{) = M. 

It remains to check Assumption Q2 in order to apply Theorem |2j Introduce the function 

E{t) = exp ( - P" 1 {Q{Xi, g + f ( gl - go)) - QiXug)}) 

= exp[-p- 1 {||g t ||2, 2 -||g||2 i2 + 2(g(X 1 )-g r (X 1 ))}], re[0,l] 

where go, gi and g are functions from the convex set JF A , and g t = go+ f(gi-go) e It is not hard 
to see that Assumption Q2 follows from the fact that the mapping t •-» ~Ef[E(t)] is concave for any 
triplet go,gi,g £ Let us prove now this concavity properly. Since the functions go,gi,g are 
uniformly bounded we get that E(-) is twice continuously differentiable and the differentiation 
inside the expectation Ey[E(f)] is legitimate. Therefore, 

— E f [E{t)] = -2^- l E f [{(g t ,h)-HX l ))Ew], 

^2 E/[S(t)] = -2p- 2 E / [(p||/i|[|-2{(gt, h) - h{Xi)} 2 )s{t)], 
where h = gi - go, and 

y ^ E /[ S ( ? )] < -[PWhWl -2(g t ,h) 2 )E f [E(t)] + 2E f \{h{X l ) 2 -2{g t ,h)h{X l )}E{t) 
This leads to 



E(f) ^ exp 



■P'MiiffriiL-iigiiLJ+^i/p 



and 



E/[S(f)] ^ exp [-P^jllgf || 2 , 2 - ||g|| 2 i2 + 4maxE / [|g(Xi)|]}| = E x {t)e- Am+Vl ^ . 
Combining these estimates with inequalities 



we get 



EIMXi) 2 ] ^ L\\h\\ 2 2> \{g t< h)\^\\g t \\ 2 \\h\\2^R\\h\\2, E[\(g t , h)h{Xi)[] ^ RL\\h\\ 2 2 , 



B 2 d 2 r 

y ^2 E/IEK?)] ^-||^|| 2 Ei(r)((p-2 J R 2 )e" 4fl(i+x/L)/p -2I-4i?I)) ^0, 



whenever (p-2fl 2 )e" 4i?(i+vT)/ P =s 2L + 4RL. This proves the concavity of f »— E/[E(f)], and thus 
the proposition. 



E/[*®f/ BJ /)]< inf (f W\,f)p(dX)+ ^P'^ j + p ( E / [Si(z,Z)]P / Wz). 



(32) 



9.5. Proof of Proposition |H In view of (26}, for any prior n and any p > the MA aggregate /„ 
satisfies the inequality 

pjT(p,7^ 

, , ' (1)1 / A, / I iJ{(IJ\i + 

pe^A V JA 

with Si(z,Z) defined by S\(z,Z) = log/ A exp(-p _1 {Q(z,/^) - Q(z,f n )}) p„(rfA). Let us introduce 
the function y/\{t) = exp (- t{Q{z, f\) - Q{z, /«)}). This function is infinitely differentiable, equals 
one at the origin and we have S\{z,Z) = log,/^ V\(P _1 )p„(dA.). Using the Taylor formula, we get 

t 2 

rffxlt) ^ 1 + t V ' x {0) + -(Q(z,/a) - Q{zJ n )) 2 e tQiz - f "\ Vr ^ 0. 
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Furthermore, since the hinge loss is convex, the Jensen inequality yields J A i//^(0) p n (dA) ^ 0. 
Replacing t by B -1 and using that Q{z,f n ) =S 1 + i?!^, we get the inequalities 

Si (z, Z) = log Va (P )Pn WAJ £ log (l + 2[32 J a (Q(z, A) - Q(z, f n )) \in id\)) 

< 2|32 J a (Q(z,/x) " Q{z,fn)) %{d\) ^ - 2 (1 + RL^ 2 . 

Thus we obtain 

/ C $J£ip,n)\ 2(l + RU) 2 e a+RL ^^ 

EflUifnJ)] ^ inf £ 9 {f\,f) Pid\) + P , + ^ , (33) 

1 peS*A V JA n+l I 6 

which is valid for any prior tt. Note that the term with the infimum in {33j coincides with the right 
hand side of the oracle inequality of Theorem [TJ Therefore, when the sparsity prior is used, this 
term can be bounded from above using Remark[3]with £/(A) = f x \rj{x)\K e {f\{x) - fix)) P x idx). 
Since also \r]{x) \ ^ 1, we get 

( 26 r M i \ 

E f [£{f„,f)} ^ min , /) + — tU a || V II i + Y log(l + t" 1 | A* |) + e + 4T 2 TrU? e ) 

+ P ' 

where the entries of the matrix M t are e" 1 f x |n(x) \<pj ix)(Pj> (x) Pxidx) with i, j = l,...,M. Thus, 
Tr(^ £ ) s; L^Me -1 , and we get the result of the proposition by minimizing the right hand side of 
the last display with respect to e > 0. 
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